Nuclease genes occupy boundaries of genetic exchange between bacteriophages

Abstract Homing endonuclease genes (HEGs) are ubiquitous selfish elements that generate targeted double-stranded DNA breaks, facilitating the recombination of the HEG DNA sequence into the break site and contributing to the evolutionary dynamics of HEG-encoding genomes. Bacteriophages (phages) are well-documented to carry HEGs, with the paramount characterization of HEGs being focused on those encoded by coliphage T4. Recently, it has been observed that the highly sampled vibriophage, ICP1, is similarly enriched with HEGs distinct from T4’s. Here, we examined the HEGs encoded by ICP1 and diverse phages, proposing HEG-driven mechanisms that contribute to phage evolution. Relative to ICP1 and T4, we found a variable distribution of HEGs across phages, with HEGs frequently encoded proximal to or within essential genes. We identified large regions (> 10kb) of high nucleotide identity flanked by HEGs, deemed HEG islands, which we hypothesize to be mobilized by the activity of flanking HEGs. Finally, we found examples of domain swapping between phage-encoded HEGs and genes encoded by other phages and phage satellites. We anticipate that HEGs have a larger impact on the evolutionary trajectory of phages than previously appreciated and that future work investigating the role of HEGs in phage evolution will continue to highlight these observations.


INTRODUCTION
Genetic exchange, whether through sexual reproduction or horizontal transfer between organisms, is a major force underl ying evolution. Typicall y, genes are selected for w hen they provide a benefit to their organism's reproducti v e success. Howe v er, some genes and genetic elements select for their own propagation while being neutral or deleterious for the organisms that encode them. Gene dri v es are one such example of selfish genetic elements, and are defined as elements that bias the inheritance of certain genes beyond the 50:50 ratio predicted by Mendelian genetics ( 1 ). The capacity of gene dri v es to effect genetic change on a population le v el has made gene dri v es a topic of great interest, both for their technological potential and as one of the natural processes underlying evolution ( 2 ).
One of the most common forms of gene dri v e is homing endonuclease genes or HEGs. HEGs are ubiquitous, being found across all domains of life, and use an elegantly simple mechanism to dri v e their inheritance. HEG mobilization occurs first with sequence recognition by the DNA binding domain and subsequent cleavage of cognate loci that lack the HEG coding sequence by the endonuclease domain. Finally, the HEG encoding sequence serves as a template for homology-dir ected r epair, spr eading the HEG coding sequence to its cognate gene locus (3)(4)(5).
HEGs were initially discovered within self-splicing introns of fungal mitochondria, and were noted for frequently interrupting essential genes ( 6 ). By associating with selfsplicing introns (distinct from eukaryotic splicesomal introns), or inteins (peptide sequences that splice out of a protein following translation) HEGs are able to utilize essential genes as a safe site for their own conservation, without compromising essential functions ( 7 ). In at least one instance, a HEG has e v en been known to split a nucleotide reductase gene into two coding sequences that maintain their essential function as two distinct peptides ( 8 ). HEGs are now known to be highly abundant in bacteriophage genomes ( 9 ). It is not uncommon for bacteriophage HEGs to occur in the intronic or inteinic configurations for which HEGs are known in eukaryotes, but the majority of phage encoded HEGs are free standing genes ( 9 ).
At first consideration, the occurrence of HEG dri v es in phages may seem strange, as viruses are not known for Mendelian inheritance; howe v er, coinfection of a single cell with distinct yet related phage isolates facilitates the recombination between phage genomes resulting in the production of hybrid progeny virions. HEGs have been shown to determine gene inheritance during phage crosses, ensuring that the HEGs themselves, as well as nearby sequences, are inherited by the hybrid progen y ( 10 , 11 ). Bey ond governing allele inheritance, HEGs can also serve as raw material for genetic innovation. There are notable examples of neofunctionaliza tion among HEG-associa ted domains. For example, in Sacchar om y ces cer evisiae , the HO endonuclease responsible for mating-type switching shares a common ancestor with the WHO-element HEGs of Torulaspora and Lachancea yeasts ( 12 ). As another example, some isolates of the ICP1 group of phages possess the gene odn , which is HEG-deri v ed and acts to protect the phage from subviral parasites ( 13 ).
HEGs have been extensively characterized in T4 and related pha ges (14)(15)(16)(17)(18), b ut interest in HEGs has not kept pace with the massi v e e xpansion of bacteriophage genomic research ( 19 ). T4-like phages have been noted as having a particularly high density of HEGs, and T4 itself stands out e v en among its relati v es. 11% of T4's coding sequence is occupied by HEGs ( 9 ). While the density of homing elements in T4like phages is remar kab le, it is not unique. ICP1, a more recently established but now extensively characterized (20)(21)(22)(23)(24)(25)(26)(27)(28)(29) model phage that infects Vibrio cholerae , has been noted as having numerous putati v e HEGs ( 28 ). As ICP1 is frequently co-isolated with V. cholerae from cholera patient samples, se v eral genetically distinct ICP1 isolates have been identified ( 22 , 28 , 30 ). These ICP1 isolates are highly similar but have been documented to encode variable genes. Given the high density of HEGs present in the ICP1 genome and the breadth of sequenced ICP1 isolates, ICP1 is an apt model organism for studying the impact of HEGs on phage genomics and evolution.
Here, we show that related HEGs are enriched in distinct phage lineages, including the ICP1-like phages, where HEGs are shown to associate with essential genes and allele exchanges similar to those described in T4-like phages. In ICP1, we show that this association between HEGs and essential genes can lead to the splicing of proximal genes, which we predict occurs in other phages as well. In many cases, HEGs bracket essential gene modules, forming what we have termed HEG-islands. HEG-islands show a striking le v el of nucleotide sequence conservation between phages that otherwise lack substantial nucleotide sequence similarity, suggesting that HEGs facilitate horizontal gene transfer between distantly related phages. We also observe sequence repetition within HEGs and domain shuffling between different HEGs, as well as between HEG and non-HEG genes. These architectural features highlight potential mechanisms for rapid HEG evolution, facilitating the emergence of novel genes within phages and phage-related mobile elements. Taken together, our observations show that HEGs are a driving force in bacteriophage evolution.

Bacterial culture and strains
Experiments were conducted using Vibrio cholerae E7946 as a host (Supplementary Table S1). Bacterial cultures were grown with aeration at 37 • C in liquid LB medium. Cultur es wer e supplemented with str eptomycin (100 g / ml) and ampicillin (100 g / ml) for Esc heric hia coli cloning intermediates.

Phage infection and isolates
All ICP1 isolates (Supplementary Table S1) were stored as high titer stocks in STE buffer (100 mM NaCl, 10 mM Tris-Cl pH 8.0, and 1 mM EDTA). All bacteriophage infections were carried out at a multiplicity of infection of 2.5 when the host cell cultur e r eached an OD 600 of 0.3 and allowed to proceed until 16 min post-infection.

RNA isolation
V ibrio choler ae E7946 was grown and infected with ICP1 as described above. Infected samples were incubated at 37 • C with aeration for 16 min, then mixed 1:1 with icecold methanol and vortexed. Samples were pelleted at 7000 × g for 3 min a t 4 • C , washed with ice-cold 1x phosphatebuffered saline, and pelleted. The pellets were lysed in 200 l of TRI Reagent (Millipore / Sigma) and incubated at room temperature for 5 min. After incubation, 40 l of chloroform was added to each sample, samples were vortexed, and incuba ted a t room tempera ture for 10 min. The TRI Reagent-chloroform mixture was then centrifuged at 12 000 × g for 10 min at 4 • C to phase-separate the mixture. The aqueous phase was removed, added to a mixture of 110 l of 2-propanol and 11 l of sodium acetate pH 6.2, and mixed by vortexing. The precipitated RNA was pelleted from the mixture by centrifuga tion a t 12 000 × g for 15 min at 4 • C, washed twice with 75% ethanol, and the residual ethanol was evaporated from the washed pellets in a bead bath at 65 • C. The RNA was resuspended in 20-30 l of diethyl dicarbona te (DEPC) trea ted H 2 O and DNase trea ted according to the Turbo DNase kit (Thermo-Fisher). DNasetrea ted RNA concentra tion and purity were assessed by NanoDrop.

Phage genomic DNA isolation
100 l of a high titer phage stock ( > 10 10 plaque-forming units / ml) was treated with 1 l of DNaseI at 37 • C for 20 min to degrade the un-encapsidated DNA and subsequently hea t-inactiva ted a t 65 • C for 10 min. Phage genomic DNA was isolated following a modified version of the DNeasy gDNA Extraction Kit (Qiagen) as follows: 360 l of buffer ATL and 40 l of proteinase K were added to the DNasetr eated phage, vortex ed, and incuba ted a t 65 • C for 20 min. 8 l of RNase A was added to the mixtur e, vortex ed to mix, and incubated for 2 min at room temperature. 400 l of AL and 400 l of 100% ethanol were added to the mixture, the solution was vortexed, and the mixture was centrifuged through the spin collection column for 1 min at 6000 × g. 500 l of wash buffer AW1 was added to the column and centrifuged at 6000 × g for 1 minute. 500 l of wash buffer AW2 was added to the column and centrifuged at 20 000 × g for 3 min to dry the column. The DNA was eluted from the column with 50 l of dH 2 O after incubation at 65 • C for 5 min.
NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 3 Splice junction validation cDNA was synthesized using SuperScript III Re v erse Transcriptase (Invitrogen) as per the manufacturer's protocol. 1 g of DNase-treated RNA was mixed with 100 ng of random hexamers and 1 l of 10 mM dNTPs to a final volume of 13 l. The mixture was heated at 65 • C for 5 min and allowed to rest on ice for 1 minute. 4 l 5x first strand synthesis buffer, 1 l 0.1 M dithiothreitol, 1 l of RNase inhibitor, and 1 l of Superscript III re v erse transcriptase were added to each reaction. Re v erse transcription reactions were incuba ted a t 25 • C for 5 min, 50 • C for 60 min, and 70 • C for 5 min to inactivate the reaction. 1 l of RNase H was added and the reaction was incubated at 37 • C for 20 min, then incuba ted a t 65 • C for 10 min to hea t-inactiva te the enzyme. The inactivated reactions were purified with the Monarch PCR purification kit (NEB).
Standard Taq pol ymerase cDN A PCRs were carried out using 0.5 l of cDNA as template DNA. Oligonucleotide primers were designed to amplify across the identified splice junctions from Magic-BLAST ( 31 ) analysis, and nested PCRs were carried out to amplify the desired products from the total cDNA synthesis products. Secondary PCRs were purified with the Monarch PCR and DNA Cleanup Kit (NEB), ligated into the pCR2.1-TOPO TA cloning vector (In vitrogen), and transf ormed into XL1 Blue E. coli. Transformants wer e scr eened by colony PCR and the splice junction sequences were determined by Sanger sequencing.

MAGIC BLAST RNA-seq read mapping
P air ed-end 150 bp × 150 bp RNA sequencing reads of V. cholerae infected with ICP1 16 min after infection were downloaded from the NCBI sequence read archi v e (PRJNA609114). Reads were trimmed to remove residual adapters and low-quality bases using Trimmomatic (v0.36) ( 32 ) and mapped to the ICP1 2006 Dha E CR Cas2 3 r efer ence genome with MAGIC-BLAST (v1.5.0) using default settings. The resulting alignments were viewed with Integrated Genome Viewer (v2.3) ( 33 ) and regions of interest were manually identified for an intergenic drop in read coverage, corresponding to potential splice junctions for subsequent validation.

ICP1 and outgroup homing endonuclease domain prediction
Genomic sequences of each ICP1 isolate and the various outgroup phages were downloaded from NCBI using the Entrez e-utilities command line tools (for accession numbers, see Supplementary Tables S1 and S2). Prediction of putati v e HEGs for all downloaded phages was initially performed with HMMER (v3.3.2) ( 38 ) using hmmsearch against the Pfam database with a maximum e-value cutoff of 0.001 to detect proteins with the Pfam domains indicated in Supplementary Table S3. For proteins with T5orf172 domains, the MUG113 domain (PF13455) was also accepted, as the two profiles are highly related. ICP1 isolates were manually investigated using BLAST and EMBOSS-Needle to identify putati v e HEGs with frameshifts , deletions , or duplications in the coding sequence.

Custom HMM profile generation
Representati v e amino acid sequences of domains from the ICP1 CapR and IPA-HNH domain-containing proteins were aligned with MAFFT and the resulting alignment was converted into a profile HMM with hmmbuild. The resulting HMM profile was searched against the UniProtKB database (Release 2021 02) with jackhmmer for 2 iterations, r estricting r esults to bacterial viruses (taxid:28883). This search detected homologs outside of ICP1 phages, increasing the captured sequence di v ersity of the profile HMM. After two iterations, the resulting sequence alignment was downloaded as a profile HMM for further analysis.

Gene neighborhood analysis
To detect phages that encode homologs to the ICP1 HEG T5orf172 and IPA-HNH families, we first generated a multiple sequence alignment of the r epr esentati v e HEGs of each family encoded by ICP1. For T5orf172 HEGs, this included r epr esentati v es Gp53, Gp121, and Gp130. For HNH-IPA HEGs this included Gp59, the Gp114-115 fusion, and Gp209. The resulting alignment was converted to a PSSM, which was used as a query for two iterations of PSI-BLAST against the NCBI nr database, restricting the results to bacterial viruses (taxid:28883). This search resulted in a list of proteins with predicted similarity to the candidate HEG family. The genomes of phages encoding these proteins were downloaded from NCBI using Entrez e-utilities command line tools for further investigation. Any sequence less than 20 kb in length and all metagenome assemblies (most of the < 20 kb length sequences), were removed from the dataset. Of the remaining sequences, the full genbank file with annotations of each phage was downloaded from NCBI (accessed August 2021). To standardize genome annotation across all of the phages, downloaded genomes were annotated with Prokka (v1.12) ( 39 ) using the Pfam-A v34.0 database for predicted functional annotations and novel coding sequence annotations predicted with Prokka were added to the genomes for downstream analyses.
These annotated genomes were then searched for proteins containing either T5orf172 or IPA-HNH HEGs with hmmsearch and gene neighborhoods were defined as the DNA sequence of 3250 bp flanking the start and stop position of each T5orf172 or IPA-HNH HEG in the genome, extended to the bounds of any open reading frames which began or terminated beyond the allocated ±3250 bp sequence region. When this region exceeded the length of the genome, genomes were assumed to be circular and the search was expanded to the opposite end of the genome until the length criteria were met. The predicted protein domain content of the coding sequences encoded by the gene neighborhood was then annotated with hmmscan against the Pfam database, retaining predicted domains with a maximum e-value cutoff of 0.1. The predicted protein content of the gene neighborhoods was assessed based on these annota tions, as indica ted in Figure 2 . Full gene neighborhoods analyzed in the study can be found in Supplementary Tables S4 (T5orf172) and S5 (IPA-HNH).

HEG island and recombination prediction
The r epr esentati v e HEG islands reported in the manuscript were identified using BLASTn to detect large regions of high nucleotide similarity between less-similar phages. Putati v e HEG islands encoding phage structural genes identified in the gene neighborhood analysis were searched pairwise against distinct phages with BLASTn to identify r epr esentati v e e xamples of HEG island e xchange. Regions of homology were marked on the gene graphs when they shared greater than 60 percent nucleotide identity over a length of at least 125 bp, with a maximum e-value cutoff of 1.0e −5 . The r epr esentati v e HEG islands were graphed using DNA Features Viewer ( 40 ) and related sequences were shaded according to the BLASTn percent identity. HEG island functional annotation was performed with the HHpred w e b server against the Pfam database, allowing for functional prediction of domains with an e-value cutoff ⇐ 0.01 for annotation.

DNA binding domain prediction and enumeration
To predict the DNA binding domains of the T5orf172 HEGs, we first identified the bounds of the predicted T5orf172 domain with HMMER. Because HEG nuclease domains are generally present on one terminus of the protein, we trimmed all amino acids from the T5orf172 domain to the nearest terminus of the protein, resulting in two sequences per protein, one predicted to encode a DNA binding domain and a second of the identified T5orf172 domain. These sequence lists were individually clustered with Cdhit (v4.8.1) ( 41 ) to generate clusters of proteins at 90% sequence identity to remove highly related sequences and the clusters were converted into a multiple sequence alignment with MAFFT. A pproximatel y maxim um-likelihood phylogenetic trees were generated from the multiple sequence alignments with FastTree.
To predict the DNA binding domain(s) of the truncated sequence predicted to contain the DNA binding domains, we first used hmmsearch to identify proteins containing either the CapR domain (custom profile) or DUFs723 and 4397 (PF05265 and PF14344, respecti v ely), both of which ar e pr edicted to encode Zn-finger DNA binding domains similar to the CapR domain. Sequences that were predicted to contain a CapR domain with a maximum e-value of 1.0e −10 were annotated as Ca pR DN A binding domains. Sequences which did not meet this threshold but were predicted to encode either DUF723 or DUF4397 were annotated as DUF723 / DUF4397. Sequences that matched none of the three domains were further investigated with HH-suite3 (v3.2.0) ( 42 ), first aligning sequences against the UniRef30 (2022 02) target database with two iterations to generate a r epr esentati v e a3m alignment. This alignment was then searched against the Pfam database using HHsearch. DUF723 or DUF4397 domains identified from HHsear ch wer e annotated according to which domain was predicted at the lowest e-value, but the number of domains present was not quantified gi v en the difference in sensitivity between HHsearch and hmmsearch. Manual investigation of the HHsearch results also re v ealed the presence of RepA N domains present in some T5orf172 proteins, leading to the identification of the putati v e E. coli satellites.

Prediction of putative E. coli phage satellites
The DNA sequence of each of the three phage-encoded RepA N-T5orf172 proteins was searched using BLASTn against the NCBI nt database to identify proteins with high primary sequence similarity, indicati v e of potential recombination e v ents between phage and bacteria. Genomes encoding these homologous sequences were downloaded and the surrounding 12 500 bp on either end of the homologous sequence was extracted. The three putati v e satellite sequences that contained RepA N domain proteins with the highest le v els of homology to the previously identified phage RepA N-T5orf172 protein were manually investiga ted for fea tures indica ti v e of satellite function, using a combination of BLAST and the HHpred w e bserver using the Pfam-A database for functional prediction using an evalue cutoff ⇐ 0.01 for annotation.
First, the three sequences were searched against one another using BLASTn, identifying large regions of gapped homolo gy, a pproximatel y 13 kb in length. The bounds of these r egions wer e manually in vestigated f or in verted repea t sequences, indica ti v e of integrase-bound attachment sites. Genes encoded between the attachment sites were then functionally annotated using the HHpred w e b server searching the Pfam database to identify the prediction function of the encoded proteins. This search re v ealed two types of integrases and two distinct attachment sites present on the thr ee r epr esentati v e sequences. Using these identified sequences, we then subjected the remaining 393 putati v e satellite-encoding contigs to a series of BLASTn searches to determine whether the observed integrases and attachment sites wer e pr esent. Any putati v e element that contained less than 10 kb of non-contiguous homology to one of the three r epr esentati v e elements was discarded, leaving 224 potential satellites. These remaining satellites are summarized in Supplementary Table S6, where we predict the putati v e attachment sites, which family of integrase is encoded by the satellite, the bounds of the satellite, and the sequence of the attachment sites. These findings are not meant to be comprehensi v e but provide a list of potential sequences for further investigation.

ICP1 phages encode diverse families of HEGs
Pre viously, numerous putati v e HEGs were identified in ICP1 phages ( 28 ). While a few of these genes contained domains previously linked to HEG activity, such as HNH-3 and LAGLIDADG, a pproximatel y half of these putati v e HEGs encoded the less-characterized T5orf172 domain (Figure 1 A). T5orf172 is a subfamily of GIY-YIG endonucleases, bearing a similar core motif, and in at least one case, T5orf172 nuclease activity was shown to be dependent on a glutamate residue whose catalytic function was inferred from multiple sequence alignments with biochemically characterized GIY-YIG endonucleases ( 13 ). The broader GIY-YIG family has been associated with homing activity in phages, and the majority of HEGs in T4 possess a GIY-YIG domain (Figure 1 A) ( 9 , 17 ). To determine if a similar enrichment of T5orf172 HEGs occurs across the ICP1 isolates and phylo geneticall y related ICP1-like phages, we bioinformatically identified the putati v e HEGs across this group of phages, using profile HMM predictions to identify genes that encode putati v e HEG domains according to the Pfam database. This analysis confirmed an abundance of T5orf172 HEGs in ICP1 phages and demonstrated the di v ersity of HEGs present in otherwise phylo geneticall y related phages. Beyond having HEGassociated domains, ICP1's putati v e HEGs exhibit several genomic signatures characteristic of HEGs: their presence is variable between ICP1 isolates (Figure 1 A) ( 14 , 15 ); when pr esent, the genes ar e unsta ble due to frameshifts, varia ble stops, or in-frame deletions ( 14 ) (Figure 1 A); they often occur near or inside essential phage genes ( 43 ); and in the case of ICP1's T5orf172 genes, there is a high le v el of nucleotide similarity between related genes, suggesting the amplification and di v ersification of a common ancestor throughout ICP1 genomes over time. While the homing activity of ICP1's putati v e HEGs has not been e xperimentally v erified, these genomic signatures provide strong evidence for homing. For the sake of concision, we will refer to these putati v e HEGs simply as HEGs.
ICP1 phages also were found to encode HEGs belonging to numerous other HEG families (Figures 1 B-E). ICP1 isolates encode some HNH-family endonucleases with similar domain ar chitectur e to T4 HNH HEGs (Figur es 1 B and C). Within ICP1, gp28 , gp80 , and gp179 each encode an N-terminal HNH-3 domain and a C-terminal zincfinger domain similar to the Mob class of HEGs from T4 ( 44 ). Additionally, we identified three ICP1 genes, gp59 , gp115 and gp209 , with domains related to, but distinct from, the typical HEG-associated HNH-3 domains. While these domains are too di v ergent from canonical HNH-3 domains for profile predictions using the HNH-3 Pfam profile, HMM-HMM searches using HHSearch identified putati v e N-terminal HNH domains in each of the three proteins, and both gp59 and gp209 encode C-terminal AP2 DNA binding domains. Although gp115 does not encode any predicted DNA binding domain, gp116 immediately downstream encodes an AP2 domain. We reasoned that these two genes likely r epr esent an inactivated HEG due to a frameshift, as the fused gp115-gp116 contains the expected domain architecture, and we performed all subsequent analyses with this fused protein sequence. In support of these genes as atypical HNH HEGs, this combination of an N-terminal HNH domain and a C-terminal AP2 domain has been previously identified as a family of HEGs in protists, cyanobacteria, and phages ( 45 ). Based on the domain ar chitectur e of these genes and the previous characterization of related genes, we conclude that they r epr esent a class of HEGs that exists within the larger HNH superfamily and will refer to them as I CP1-like P hage A P2 domain associated HNH-domain HEGs or IPA-HNH HEGs.

T5orf172 and IPA-HNH HEGs are enriched in diverse phage lineages
As phage HEG r esear ch has been largely dominated by the study of T4-like phages, and T4 does not possess T5orf172 genes or IPA-HNH genes, we searched broadly among phages for HEGs homologous to those in ICP1, hoping to broaden our understanding of HEG di v ersity within and across phage genomes. Besides their abundance in the ICP1 isolates and closely related phages, both T5orf172 and IPA-HNH genes are enriched in multiple taxonomically di v erse linea ges. T5orf172 genes are ab undant in the Plateaulakeviruses, a clade of Aeromonas sp. infecting pha ges, T5-like pha ges, and an unclassified group of Vibrio infecting siphoviridae phages ( 46 ) that we have designated VpaJT-like phages for their sequence similarity to the phage VpaJT-1 ( 47 ) (Figure 2 A). IPA-HNH coding genes are abundant in separate lineages, namely the marine Barbaviruses, and the Pseudomonas sp. infecting Flaumdraviruses (Figure 2 B). The reasons for the enrichment of specific phage taxa with specific HEG families are unclear.
Possib le e xplanations include specific biological traits, such as a phage's ability to exclude co-infecting phages, and early invasion into a phage lineage by a specific HEG class.
We next constructed a phylogenetic tree of T5orf172 domain sequences to examine if the relationship between T5orf172 proteins also suggested expansion within specific pha ge linea ges (Supplementary Figure S1). Generally, T5orf172 sequences are more related to sequences from the same pha ge linea ge, b ut we also found evidence for some inter-lineage exchange of T5orf172 genes (Supplementary Figure S1). A small number of T5orf172 domains from ICP1 and ICP1-like phages occur within a clade comprised of sequences from T5-like phages, while about half of the T5orf172 genes in ICP1 form a distinct clade and the other half ar e mor e dispersed. These r esults indica te tha t HEG expansion within specific phage lineages is primarily responsible for HEG distrib ution, b ut ther e is an appr eciab le le v el of horizontal HEG transfer between different phage lineages.

HEGs cluster near essential genes in diverse phage lineages
Pre vious wor k inv estigating HEGs in phages has shown that HEGs often integrate near or within essential genes or operons ( 9 , 43 , 48-50 ), including those encoding structural genes, the terminase, ribonucleotide reductases, and the replisome. Using the previously identified genomes that encode T5orf172 and IPA-HNH HEGs, we performed a gene neighborhood analysis surveying the genes encoded within the flanking 3250 bp of the detected HEGs, aiming to assess whether a similar association occurs between the HEG families in our dataset and their integration sites (Figure 2 A and B). Consistent with previous work investigating HEGs in phages, we find that both T5orf172 and IPA-HNH coding genes often occur near or within essential genes and operons, including those for structural genes, terminase , ribonucleotide reductase , and the replisome HEGs are often localized in gene neighborhoods that include essential phage genes. Candidate phage genomes containing homologs to ICP1 r epr esentati v e nucleases were identified by PSI-BLAST and the gene neighborhoods containing the ± 3250 bp flanking each predicted HEG present were annotated for predicted function. Phylogenies of phage genomes were generated using the Viptree ( 34 ) algorithm to generate a proteomic tree of the phage relatedness based on tBLASTx. Gene neighborhoods including genes predicted to encode a phage terminase, capsid, tape measure protein (TMP), ribonucleotide reductase (RNR), or replicase component were indicated with a black box at the end of the leaf. The number of putati v e HEGs in each genome is r epr esented by the grayscale bar surrounding the tree, with all phages encoding at least one r epr esentati v e HEG. Clades with a high density of HEGs discussed throughout the manuscript are highlighted accor dingly. ( Figure 2 C-F). We surveyed 468 T5orf172 HEGs encoded by 204 distinct phage genomes, finding that 293 of the T5orf172 gene neighborhoods encoded at least one of he indicated essential marker genes. Only 40 of the 204 phage genomes did not harbor a T5orf172 HEGs in such gene neighborhoods, supporting well the enrichment of HEGs proximal to essential genes. T5orf172 HEGs were most frequently observed in gene neighborhoods containing a terminase gene (35% of neighborhoods), followed by ribonucleotide reductases (33%) and capsid genes (31%) (Figure  2 A, Supplementary Table S4). We observed similar but less pronounced trends when performing a parallel analysis on the 431 IPA-HNH HEGs from 240 phage genomes. 209 of these IPA-HNH HEGs were encoded within gene neighborhoods containing essential genes and 149 of the 240 genomes contained at least one IPA-HNH HEG within such gene neighborhoods. IPA-HNH HEGs were found to occur proximal to DN A pol ymerase genes (29%), ribonucleotide reductase genes (18%), and terminase genes (14%) (Figure 2 B, Supplementary Table S5). Despite this decr eased r elati v e frequency of IPA-HNH association with the indicated marker genes, specific clades of phages, such as the Barbaviruses, show a strong association between the HEGs and the indicated gene families (Figure 2 B, Supple- isolates share a similar ar chitectur e, with a single HNH-3 HEG between the two TMP fragments (Top gene graph). A single ICP1 isolate, ICP1 1992 M4 (ICP1 M4), encodes two intronic HNH-3 HEGs in the same locus, both unique from the HNH-3 HEG found in the majority of ICP1 isolates (bottom gene graph). Despite the considerable difference in intronic sequence, all ICP1 isolates undergo differential splicing to excise unique homing endonuclease genes from TMP transcripts (A), resulting in nearly identical TMP sequences. Percent nucleotide identity between the conserved sequences is indicated by the black par allelogr am between the gene graphs. All ICP1 isolates share conserved splicing of the large terminase ( terL ) (C), which is neighbored by a T5orf172 HEG. RNA-seq read mapping of total RNA isolated from ICP1 2006 Dha E (top read coverage graphs aligned to a gene graph of the TMP (A) or terminase (C) coding region) shows a distinct drop in reads in intronic regions compared to that of the exonic regions. The splice junctions of the exonic transcript are represented below the gene graph and exons are labeled E1-4 (TMP) or E1-2 ( terL ), with dashed lines r epr esenting sequences joined by splicing. Nucleotides highlighted in bold are the predicted start and stop codons of the annotated coding sequences on the gene gra phs, w hich are removed during the splicing process. (B, D) Agarose gels of PCR products generated following PCR across the splice sites of the TMP (B) or terL (D) sequence from gDNA and cDNA confirms splicing present in cDNA samples. Selected size markers from 2-Log DNA Ladder are indicated by the numbers to the left of the gels in base pairs, Sanger sequencing of cDNA products confirmed the exon mapping r epr esented in (A) and (C). Black inward-facing arrows r epr esent the primers used for both the cDNA and gDNA PCR reactions and the indicated size of the gDNA product is indicated in base pairs. mentary Table S5). Notably, not all phages identified in this analysis are expected to encode all of the indicated genes due to differences in phage and lifestyle. For example, many phages rely on cell-encoded replisomes rather than encoding their own. The absence of these features in some genomes likely results in an underrepresentation of the frequency of HEG integration proximal to the indicated genes.
When these gene neighborhoods were examined more closely, many interesting trends were observed. We often found HEGs of other classes near T5orf172 or IPA-HNH HEGs ( Figure 2 C and E), suggesting these loci may be favorable for HEG acquisition or retention. While the majority of HEGs observed appear to be free-standing genes, we did observe numerous cases where other genes were interrupted by a HEG sequence. At least 8 of the 20 LAGL-IDADG endonucleases within our dataset were predicted to be intein-encoded within a nr dA gene , similar to that of ICP1 (Figure 2 E). Another frequently interrupted gene encodes the tape measure protein (TMP), which serves as a scaff old f or tail tube assembly ( 51 ). ICP1's own TMP is interrupted by a HEG, consistent with other documented examples of intron-associated HEGs and spliced transcripts ( 6 , 52 , 53 ).
To determine whether the splicing of ICP1's interrupted TMP gene forms a full-length TMP gene, we analyzed previous RNA sequencing da ta genera ted during ICP1 infection ( 27 , 54 ) with the NCBI Magic-BLAST tool, which is designed for detecting candidate intronic sequences from RNA sequencing data sets ( 31 ). From this analysis, we de-tected the splicing of the multiple ICP1 transcripts (Figure 3 ). To confirm the splicing of the ICP1 TMP and determine the full exonic sequence, we used reverse transcriptase PCR (RT-PCR) of RNA isolated during phage infection coupled with Sanger sequencing. The ICP1 TMP coding transcript is comprised of four exons and three introns, with the middle intron typically encoding an HNH-3 HEG (Figure 3 A). Remar kab ly, the oldest ICP1 isolate sequenced thus far, ICP1 1992 Ind M4 (ICP1 M4), encodes variable intronic sequences while maintaining the splice junctions and undergoing only minimal changes to the surrounding exonic nucleotide sequence (Figure 3 B). Instead of a single HNH-3 HEG occupying the middle intron, ICP1 M4 has two HNH-3 HEGs, one in the first and third intron, while the middle intron is empty (Figure 3 A). Each of the HEGs shares a pproximatel y 30% amino acid identity with another, suggesting this locus tolerates the integration of distinct HEGs. HEG-dri v en gene conversion is well documented, but this example from ICP1 shows that the HEG content of introns can vary while the surrounding sequence context is nearly unchanged, adding another evolutionary route for the di v ersification of homing loci.
The Magic-BLAST analysis of RNA-seq data also revealed the splicing of ICP1's terminase gene (Figure 3 C), and the splicing of the terminase transcript was confirmed through RT-PCR analysis (Figure 3 D). While phage functional annotation pipelines identified the larger Cterminal exon as encoding a full-length large terminase, the N-terminal exon had not been gi v en a predicted function, as is commonly the case for short phage genes. Notably, ICP1 encodes a T5orf172 HEG adjacent to the terminase N-terminal exon rather than inside of the terminase intron (Figure 3 C). Such an arrangement between introns and HEGs has been observ ed pre viously in T3 and other T7-like phages, as well as a cyanophage, and is described as collaborati v e homing ( 55 , 56 ). In cases of collaborati v e homing, HEG endonucleolytic cleavage occurs at the intron insertion sequence, promoting the acquisition of both the HEG and the intron together. In such a scenario, it is not clear how the arrangement for collaborati v e homing would arise in the first place. An alternati v e hypothesis is that these collaborati v e homing loci occur due to HEG turnover, leading to the acquisition of a new free-standing HEG, and the purging of the intronic HEG, while leaving the intron splice junctions intact.

HEGs bracket essential genes and operons forming 'HEGislands'
Often, multiple HEGs are encoded near the same essential locus. This suggests that similar to collaborati v e homing between free-standing HEGs and introns, collaborati v e homing may also occur between neighboring HEGs with different specificities. Cooperation between HEGs could increase the mobility of each HEG individuall y w hile also helping to avoid replacement by foreign homing loci. We observed many instances of essential genes, and in some cases, entir e cor e operons, flanked on both ends by HEGs (Figure 4 , Supplementary Figure S2). This arrangement is suggesti v e of a mechanism wherein the flanking HEGs serve as vehicles to mobilize the essential loci, which are cargo of a complex homing region. In line with this model, we have named these regions HEG-islands. HEG-islands show remar kab le conservation between phages on the nucleotide le v el ( Figure  4 , Supplementary Figure S2). In phages that lack e xtensi v e nucleotide similarity across most of their genomes, we find HEG-islands with nucleotide identity greater than 90%, an incredib le le v el of conservation for cargo mobilized through a recombinational repair homing mechanism.
The HEG content of related HEG-islands is unstable. While the related HEG-islands we observed typically had HEGs in the same locations, these cognate HEGs could be homologs with shared nucleotide identity, HEGs of the same class without obvious nucleotide similarity, or HEGs of entir ely differ ent classes (Figur e 4 , Supplementary Figur e  S2). In one r epr esentati v e e xample, the tail operons (Figure 4 A), we observed the replacement of the commonly occurring HNH-3 HEG with a PD-(D / E)XK HEG, a family HEGs found in bacteria ( 57 , 58 ) but not found in ICP1 or most ICP1-like phages. HEG interchangeability may come about because multiple HEGs have evolved symbiotic rela tionships with rela ted cargo genes, increasing the inheritance of the HEGs w hile concurrentl y affecting the fitness of the host through maintenance or turnover of essential gene modules. Gi v en the essential functions of their cargo and their presence in widely di v erged phages, HEG-islands could be a major force of phage evolution.

HEGs and HEG-related proteins display repetitive domain ar chitectur e and domain shuffling
We next wanted to assess the domain ar chitectur e of the T5orf172 HEGs to gain more insight into their evolution. Pr eviously, it was r ecognized that the DNA binding domains of ICP1 T5orf172 HEGs have homology to CapR ( 59 ), a transcriptional r epr essor of ICP1's capsid expression encoded by phage-inducible chromosomal islandlike elements (PLEs). PLEs are sub-vir al par asites that exploit the ICP1 life cycle for their own mobilization ( 24-27 , 29 , 54 , 59-62 ), inhibiting ICP1 progeny formation. This antagonism is frequently met with counter-adaptation by the phage ( 13 , 21 , 23 , 63 ), providing a model system for the study of subcellular molecular co-evolution. The CapR DNA-binding domain was shown to be a Zn-finger domain with a CxxC. . . CxC motif ( 27 ). Using a custom HMM profile r epr esentati v e of the Ca pR DN A binding domain from all known PLEs, we found that many of the T5orf172 genes in our dataset ar e pr edicted to encode at least one CapR domain (95 / 229 T5orf172 HEGs, 41.5%), providing further support to the association between the CapR and T5orf172 domains. Of those that did not encode a predicted CapR domain, one of two other highly related DNA binding domains, DUF723 or DUF4397, w as almost alw ays present. All three profile HMMs are distinct from one another, but each contains a similar Zn-finger domain and we anticipa te tha t these domains each provide DNA binding activity to the HEG. In total, the vast majority of the identified T5orf172 HEGs (200 / 229, 87.3%) contained at least one of the three Zn-finger domains ( Figure 5 ). Interestingly, we found that many T5orf172 encoding genes encode multiple distinct Zn-finger domains (Figure 5 B). This repetiti v e domain organization is characteristic of class V repeat proteins. Class V repeat proteins can be described as 'beads on a string' as they consist of repetiti v e, indepen-dently folded and functional domains connected by linker sequence ( 64 ). Target recognition is then an emergent property of the different domains acting together to provide specificity of DNA binding. These repeat domains may act cooperati v ely to provide such specificity to T5orf172 HEG DNA binding or to regulate cleavage activity.
The modularity of DNA binding regions would lend itself to the rapid di v ersification of target recognition if modules can be readily exchanged or altered. Beyond their modularity, genes with r epeats pr esent on the nucleotide le v el may provide a favorable substrate for recombination. Even if the le v el of shared nucleotide identity between repeats is relati v ely low, recombinants would be expected to arise during phage infection. Gener ally, char acterized phage recombination systems can act on far lower le v els of homology and sequence similarity than is r equir ed for systems encoded by cellular organisms ( 65 , 66 ). Using a combination of BLASTn and manual curation, we identified degenerate repeats in the DNA binding coding regions of some ICP1 T5orf172 HEGs (Figure 6 A). Being heavily sampled and sequenced, ICP1 isolates provide a clear example of recombina tion between repea ts internal to a HEG, gp174 ( Figure  6 A). Recombination between degenerate repeats in ICP1 gp174 is predicted to have resulted in an in-frame 228bp deletion (Figures 1 A, 6 A). Previous work has demonstrated tha t small muta tions can cause HEGs to have large shifts in DNA substrate specificity ( 67 ). As such, this deletion would be predicted to have a strong effect on HEG specificity.
In addition to recombination between domains within a single HEG, we also detected evidence for recombination between separate HEG sequences. By generating nucleotide alignments of four HEGs from the Vibrio phages Helene, 1.187.F1 and 184E37.3a (See Supplementary Table S2 for genome accession numbers), we found evidence highly suggesti v e of domain swapping between the HEGs. Of these HEGs, a subset contained homologous DNA binding regions but unrelated T5orf172 domains, while others contained unrelated DNA binding regions and homologous T5orf172 domains (Figure 6 B and C). The homologydri v en recombination between HEGs supports that the domains of T5orf172-family HEGs are modular and can be exchanged. A pr edisposition for r ecombination is likely to make HEGs more e volvab le and may contribute to their success in certain phage lineages.

HEG-associated domains are engaged in phage-satellite conflicts
Horizontal transfer of HEG-associated domains is not limited to recombination within and between phages. Previous work has also established an evolutionary link between ICP1's T5orf172 HEGs and genes encoded by PLE, a satellite virus of ICP1. PLE-encoded CapR has similarity to some ICP1-encoded T5orf172 genes and is involved in PLE's parasitism and antagonism of ICP1 ( 59 ). The le v el of similarity between CapR and some ICP1-encoded T5orf172 HEGs is variable across different PLE variants and appears strongest in a more recently discovered PLE variant, PLE 7. In multiple capR alleles, the le v el of sequence homology to gp174 is greater than 70% across a 300 bp region, indicating a relati v ely recent genetic exchange between PLE and ICP1. Notably, both regions of the repeat sequence of gp174 , as well as another ICP1 T5orf172 HEG gp165 , share considerable nucleotide identity with capR (Figure 7 A, B), suggesting that similar recombination dynamics could be at play in the evolution of CapR and ICP1 HEGs.
Another characterized example of a HEG-related domain and a horizontally acquired domain being involved in the PLE-ICP1 arms race occurs with ICP1's o rigin d irected n uclease (Odn). Odn possesses a HEG-deri v ed T5orf172 nuclease domain, and a RepA N DNA binding domain similar to that of the RepA replication initiation factor encoded by se v eral PLEs ( 13 , 26 ). During infection, Odn binds and cleaves the cognate PLE origin of r eplication, pr e v enting PLE from parasitizing ICP1 ( 13 ). It is hypothesized that this specificity is an example of HEG domestication, wherein the ancestral Odn allele has recombined with the PLE RepA N DNA binding domain, increasing the fitness of phages containing Odn as they can inactivate PLE, bolstering phage replication (Figure 8 A). Consistent with this, some PLEs possess different RepA N containing replication initiation factors and cognate origins of replication, and these PLEs are immune to interference by Odn ( 13 ). When analyzing the T5orf172 DNA binding domains that lacked one of the three Zn-finger DNA binding domains (CapR, DUF723, or DUF4397) ( Figure 5 ), we found that odn was not the only ICP1 gene in the set that encoded a RepA N domain. A second ICP1 T5orf172 encoding gene, gp5 , also encodes a RepA N domain. The RepA N domain of Gp5 has sequence similarity to the RepA N domain of a bioinformatically identified MGE with considerable gene synteny to PLE identified from a non-O1 V. cholerae isola te (Figure 7 C , D). The le v el of amino acid similarity between the Gp5 RepA N domain and its cognate PLE domain is approximate to the le v el of similarity between Odn's RepA N domain and its own cognate PLE domain. Gp5 is frameshifted in over half of ICP1 isolates (Figure 1 A), which we hypothesize indicates that this protein encodes an accessory function for many ICP1 isolates, such as specific defense against PLEs containing this novel RepA / origin combination. These factors suggest that Gp5 may r epr esent a second example of the domestication of a HEG in the ICP1-MGE conflict, resulting in an ICP1 nuclease that is potentially directed against a PLE origin of replication.
Aside from the ICP1 RepA N domain containing T5orf172 proteins, our analysis of DNA binding domains associated with T5orf172 domains ( Figure 5 ) found proteins with similar domain ar chitectur e in three closely related E. coli phages, Inny, Muut, and Mt1B1 P17 (Figure  8 B). These proteins shared a similar domain architecture to Odn, raising the question of whether they may also function   In the absence of Odn, the phage satellite life cycle continues unperturbed, inhibiting phage progeny production. Following recombination to acquire Odn, phages are able to overcome the satellite and viable phages are generated from the infection. The PLE gene graph shows the features of PLE 1 and highlights characterized PLE features with la bels a bove the diagram. (B) Examples of phages that are predicted to encode novel odn genes. RepA N domains of putati v e odn genes are colored identically to their corresponding satellite in panel C. Regions of nucleotide identity greater than 60% are shaded accordingly. Each odn homolog encodes a second RepA N domain without detectable homology to cognate satellite repA genes, which is shown in black.

(C)
Putati v e phage satellites that ar e pr edicted to be targets of each respecti v e odn homolog indicated in panel B. Gene graphs are colored according to phage genome, which corresponds to the odn RepA domain in panel B. Predicted attL and attR recombinase attachment sites are indicated with red (tRNA integrated) and green ( lysR integrated) lines and marked L and R. Genes with a strong functional prediction from HHsearch analysis are labeled with gene names or predicted protein function. Dec = decoration, MCP = major capsid protein, Int = integrase, BP = base-plate hub, TMP = tape measure protein.
(D) A schematic of the hypothetical exchanges of domains between related phage odn homologs as anti-satellite nucleases. Consistent with this hypothesis, a BLASTn search of the DNA encoding the RepA N domain from each of the three phages against the NCBI nr database detected bacterial homologs of the RepA N encoding sequence. These bacterial homologs lacked a T5orf172 domain, supporting the idea that they may serve as target sequences for the coliphage encoded T5orf172 / RepA N domain proteins, similar to Odn and the PLE replication initiator, RepA (Figure 8 A, ( 13 )).
Further inspection of the genomic loci encoding the RepA N domain homologs re v ealed characteristic features of phage satellites. These putati v e satellites encode a sim-ilar suite of predicted proteins, are flanked by attachment sites, and lack the full suite of genes for autonomous phage pr ogeny pr oduction, suggesti v e of satellite function (Figure  8 C). Similar to the recently described cfPICIs ( 68 ), these satellites appear to encode genes r equir ed for capsid morphogenesis, such as major capsid proteins and decoration proteins. Ne v ertheless, these satellites lacked many of the other signature genes of the cfPICI group such as terminase genes, a prohead serine protease, an alpA homolog, head-tail adapter genes, and a Pri-Rep replication initiator ( 68 ). The new satellites also encode putati v e tape measure proteins and a predicted component of the phage-tail baseplate but lack the majority of genes r equir ed for tail morphogenesis.
Through BLASTn analysis searching the identified putati v e sa tellites tha t ma tch Inny, Muut, and Mt1b1 P17, we were able to identify a set of 224 putati v e satellites (Supplementary Table S6) encoded by E. coli and closely related organisms. We estimated the ends of most of the putati v e MGEs by detecting repetiti v e a ttachment sites a t the end of the elements. Most of the identified elements encode a tyrosine recombinase integrase next to one of the attachment sites and were found integrated next to tRNAs. A subset of the elements, r epr esented by CP057957 (Figure 8 C), lacked the most common putati v e attachment sites and were found to encode additional integration machinery, including a novel serine recombinase and attachment sites. These satellites were found to be occasionally integrated into novel loci compared to the tRNA-proximal elements, suggesti v e of integration into the novel site by the alternati v e integration machinery (Supplementary Table S6).
Much like PLEs, these satellites also exhibited modularity in their RepA N-terminal DNA-binding domain, suggesting the replacement of the satellite replicon components with a functionally analogous nucleotide sequence. Remar kab ly, this variability in satellite-encoded RepA N domains was mirrored by the phage nucleases; the homologous T5orf172 domain-containing proteins in Inny, Muut, and Mt1B1 P17 each encoded separate RepA N domains, each homologous to a unique satellite RepA protein (Figure 8 C, Supplementary Figure S3A-C). These observations strongly suggest that the three E. coli phage nucleases are another outcome of an evolutionary arms race between phages and their satellites and that odn genes have evolved multiple times in separate pha ge linea ges, highlighting the neofunctionaliza tion of HEG-associa ted nuclease domains to the benefit of their genomic hosts.
While the order in which these homologous odn genes arose cannot be determined, one can infer reasonable evolutionary steps between them. Inny and Muut have a high le v el of sequence similarity, while Mt1B1 P17 is much more di v erged (Supplementary Figure S3D). This leads to a model where the change in RepA N domains between Inny and Muut is an example of domain swapping within a genome, while the nuclease domain must have been horizontally transferred between the Mt1B1 P17 lineage and the Inny and Muut lineage before or after undergoing another DNA binding domain swap (Figure 8 D). This model illustrates the flexibility of HEG-associated domains. Capable of both in situ mutability and horizontal transfer, these nucleases readily e volv e ne w specificities and functions.

DISCUSSION
Here, we have provided one of the most extensive bioinformatic surveys of phage HEGs to date, with a particular emphasis on a new HEG lineage employing a T5orf172 nuclease domain. Our observations show HEGs are at the sites of genetic exchanges, not only between phages but between phage and satellite pairs as well. Of particular note, we find that HEGs bracket essential or otherwise highly selected for gene clusters that are well-conserved at the nucleotide le v el despite occurring in phage genomes that are related onl y distantl y. We propose these clusters to be a new class of mobile element, the HEG-island. HEG-islands appear to be complex elements composed of two or more flanking HEG genes that provide homing activity and cargo genes that bear key functions of the phage lifecy cle. Gi v en the e xtremely important roles that their cargo genes play, we expect HEG-islands to be a driving force in phage evolution.
Unlike most MGEs, HEG-islands are not expected to have discrete boundaries. HEG-islands are likely to mobilize the totality of their cargo through successi v e homing e v ents rather than as a single reaction, and partial HEGisland mobilization may explain why some HEG-islands show partial and polar homology to each other ( Figure  4 ). Collaborati v e homing between flanking HEG sequences would be expected to provide multiple benefits. In a simple scenario, a second proximal HEG could provide a second line of homing activity if one HEG's target site is absent, or if a competing HEG is homing into the HEG-island. The organization of gene modules into mobile islands that are bracketed by HEGs may be linked to the functional requirements of those modules. The HEG-islands we've identified encode for essential heteromeric protein assemblies. Gene conversion of the entire module may be selected for, as conversion of only some genes within a module could bring together alleles that are not functionally compatible, increasing the risk of inviable pro geny w hen onl y one of the flanking HEGs is able to home into the target sequence. Additionall y, m ultiple HEGs may be a r equir ement for HEG-island homing between distantly related phages. Homing relies on homologous recombination, and for homing e v ents between distantly related pha ges, the b ulk of the template suitable for homologous recombination would likely be internal to the HEG-islands themselves. In T4, homing is known to be a highly asymmetric process ( 69 ).
Successi v e asymmetric recombination e v ents might allow HEG-islands to mobilize into new phage genomes through homology internal to the islands themselves.
HEG-islands complicate the traditional view of HEGs as selfish MGEs. While HEGs are clearly antagonistic towards the alleles they target, homing is m utuall y beneficial for HEGs and their cargo genes. Organization of cargo into HEG-islands that persist across distantly related phages suggests a high le v el of coe volution between HEGs and cargo genes. Rather than simply being fortunate bystanders mobilized by the action of a sub-genomic parasite, cargo genes are likely an essential part of HEG-islands, providing homology needed for homing as well as critical functions that select for maintenance of the HEG-island. The turnover of HEG-islands is likely rare and punctuated. Across nearly two decades of sampled ICP1 isolates, we see fairly limited HEG turnover and variation (Figure 1 A).
Howe v er, analysis of some ICP1-like phage genomes suggests relati v ely recent e xchanges of multiple HEG islands ( Figure 4 ). The exchange of multiple islands encoding essential functions suggests tha t, a t one extreme, some phages can be seen as loose consortiums of HEG-islands which are themselves consortiums of cargo and HEGs.
It is unclear just how ubiquitous HEGs and HEG-islands are among bacteriophages. Our analysis of T5orf172 HEGs and IPA-HNH HEGs re v eals se v eral HEG dense phage lineages (Figure 2 ), while pre vious wor k has shown HEGs to